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Abstract. UCT, a state-of-the art algorithm for Monte Carlo tree 
search (MCTS) in games and Markov decision processes, is based 
on UCBl, a sampling policy for the Multi-armed Bandit problem 
(MAB) that minimizes the cumulative regret. However, search dif- 
fers from MAB in that in MCTS it is usually only the final "arm 
pull" (the actual move selection) that collects a reward, rather than 
all "arm pulls". In this paper, an MCTS sampling policy based on 
Value of Information (VOI) estimates of rollouts is suggested. Em- 
pirical evaluation of the policy and comparison to UCB 1 and UCT is 
performed on random MAB instances as well as on Computer Go. 

1 Introduction 

MCTS, and especially UCT f9l appears in numerous search appli- 
cations, such as |4|. Although these methods are shown to be suc- 
cessful empirically, most authors appear to be using UCT "because it 
has been shown to be successful in the past", and "because it does a 
good job of trading off exploration and exploitation". While the lat- 
ter statement may be correct for the Multi-armed Bandit problem and 
for the UCBl algorithm 11], we argue that a simple reconsideration 
from basic principles can result in schemes that outperform UCT. 

The core issue is that in MCTS for adversarial search and search in 
"games against nature" the goal is typically to find the best first action 
of a good (or even optimal) policy, which is closer to minimizing the 
simple regret, rather than the cumulative regret minimized by UCB I . 
However, the simple and the cumulative regret cannot be minimized 
simultaneously; moreover, f3\ shows that in many cases the smaller 
the cumulative regret, the greater the simple regret. 

We begin with background definitions and related work. VOI es- 
timates for arm pulls in MAB are presented, and a VOI-aware sam- 
pling policy is suggested, both for the simple regret in MAB and for 
MCTS. Finally, the performance of the proposed sampling policy is 
evaluated on sets of Bernoulli arms and on Computer GO, showing 
the improved performance. 

2 Background and Related Work 

Monte-Carlo tree search was initially suggested as a scheme for 
finding approximately optimal policies for Markov Decision Pro- 
cesses (MDP). MCTS explores an MDP by performing rollouts — 
trajectories from the cuiTent state to a state in which a termination 
condition is satisfied (either the goal or a cutoff state). 

Taking a sequence of samples in order to minimize the regret of a 
decision based on the samples is captured by the Multi-armed Ban- 
dit problem (MAB) 1 1 1 1. In MAB, we have a set of K arms. Each 
arm can be pulled multiple times. When the ith arm is pulled, a 
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random reward Xi from an unknown stationary distribution is en- 
countered. In the cumulative setting, all encountered rewards are col- 
lected. UCB 1 1 1 1 was shown to be near-optimal in this respect. UCT, 
an extension of UCBl to MCTS is described in |9|, and shown to 
outperform many state of the art search algorithms in both MDP and 
adversarial search (5]|11. In the simple regret setting, the agent gets 
to collect only the reward of the last pull. 

Definition 1. The simple regret of a sampling policy for MAB is 
the expected difference between the best expected reward and the 
expected reward /ij of the empirically best arm Xj= max; Xi.- 

K 

Er = V Pr(Xj = maxXi) (1) 

— ^ i 

where Aj = /i, — /ij. 

Strategies that minimize the simple regret are called pure explo- 
ration strategies |3|. 

A different scheme for control of sampling can use the princi- 
ples of bounded rationality |8| and rational metareasoning 1 10, 6|. In 
search, one maintains a current best action a, and finds the expected 
gain from finding another action /3 to be better than the current best. 

3 Upper Bounds on Value of Information 

The intrinsic VOI A^ of pulling an arm is the expected decrease in the 
regret compared to selecting the best arm without pulling any arm at 
all. Two cases are possible: 

• the arm a with the highest sample mean Xa is pulled, and Xc, 
becomes lower than Xp of the second-best arm (3; 

• another arm i is pulled, and Xi becomes higher than X^- 

The myopic VOI estimate is of limited applicability to Monte-Carlo 
sampling, since the effect of a single sample is small, and the myopic 
VOI estimate will often be zero. However, for the common case of a 
fixed budget of samples per node, Ai can be estimated as the intrinsic 
VOI Ai of pulling the zth arm for the rest of the budget. Let us denote 
the current number of samples of the ith arm by rii, and the remaining 
number of samples by N: 

Tlieorem 1. A.^ is bounded from above as 

< ^ Pr(X < < ^ Pr(X < Xp) (2) 
A?,^. < PriX > X.) < Pr(X: > X„) 

where Xi is the sample mean of the ith arm after nt + N samples. 

The probabilities can be bounded from above using the Hoeffding 
inequality [7j: 



Theorem 2. The probabilities in equations |2]) are bounded from 
above as 

Pr(X, < Xp) < 

Pr(X;,,^, > Xp) < 2exp {~ip{n,){Xc, - Xrfn,) (3) 
where <p(n) = 2( > 1.37. 

^l + -^n/iV' 

Corollary 1. An upper bound on the VOI estimate A.\ is obtained by 
substituting Q into Q. 

A' < = exp (-1.37(X, - Xpfn^) 

A?|,^„ <Al= ^^^^"^"^ exp (-1.37(X„ - X^fn,) (4) 

4 VOI-based Sample Allocation 

Following the principles of rational metareasoning, for pure explo- 
ration in Multi-armed Bandits an arm with the highest VOI should 
be pulled at each step. The upper bounds established in Corollary [T] 
can be used as VOI estimates. In MCTS, pure exploration takes place 
at the first step of a rollout, where an action with the highest util- 
ity must be chosen. MCTS differs from pure exploration in Multi- 
armed Bandits in that the distributions of the rewards are not sta- 
tionary. However, VOI estimates computed as for stationary distribu- 
tions work well in practice. As illustrated by the empirical evaluation 
(Section[5|, estimates based on upper bounds on the VOI result in a 
rational sampling policy exceeding the performance of some state- 
of-the-art heuristic algorithms. 

5 Empirical Evaluation 
5.1 Selecting The Best Arm 
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bandit problem instances. Figure [T| shows results for randomly- 
generated Multi-armed bandits with 32 Bernoulli arms, with the 
mean rewards of the arms distributed uniformly in the range [0, 1], 
for a range of sample budgets 32.. 1024, with multiplicative step of 2. 
The experiment for each number of samples was repeated 10000 
times. UCB 1 is always considerably worse than the VOI-aware sam- 
pling policy. 

5.2 Playing Go Against UCT 

The policies were also compared on Computer Go, a search domain 
in which UCT-based MCTS has been particularly successful |5 1. A 
modified version of Pachi 121, a state of the art Go program, was 
used for the experiments. The UCT engine was extended with a VOI- 
aware sampling policy, and a time allocation mode ensuring that both 
the original UCT policy and the VOI-aware policy use the same aver- 
age number of samples per node was added. (While the UCT engine 



is not the most powerful engine of Pachi, it is still a strong player; on 
the other hand, additional features of more advanced engines would 
obstruct the MCTS phenomena which are the subject of the experi- 
ment.) The engines were compared on the 9x9 board, for 5000, 7000, 
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10000, and''iTO?^aifipie^Wpf!^McK%PM'eW^a^ repeated 
1000 times. Figure|2]shows the winning rate of VOI against UCT vs. 
the number of samples. For most numbers of samples per node, VOI 
outperforms UCT. 

6 Summary and Future Work 

This work suggested a Monte-Carlo sampling policy in which sam- 
ple selection is based on upper bounds on the value of information. 
Empirical evaluation showed that this policy outperforms heuristic 
algorithms for pure exploration in MAB, as well as for MCTS. 

MCTS still remains a largely unexplored field of application of 
VOI-aware algorithms. More elaborate VOI estimates, taking into 
consideration re-use of samples in future search states should be con- 
sidered. The policy introduced in the paper differs from the UCT 
algorithm only at the first step, where the VOI-aware decisions are 
made. Consistent application of principles of rational metareasoning 
at all steps of a rollout may further improve the sampling. 
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